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TESTING SIGNIFICANCE OF FEATURES BY LASSOED 
PRINCIPAL COMPONENTS 

By Daniel a M. Witten 1 and Robert Tibshirani 2 

Stanford University 

We consider the problem of testing the significance of features 
in high-dimensional settings. In particular, we test for differentially- 
expressed genes in a microarray experiment. We wish to identify genes 
that are associated with some type of outcome, such as survival time 
or cancer type. We propose a new procedure, called Lassoed Princi- 
pal Components (LPC), that builds upon existing methods and can 
provide a sizable improvement. For instance, in the case of two-class 
data, a standard (albeit simple) approach might be to compute a two- 
sample t-statistic for each gene. The LPC method involves projecting 
these conventional gene scores onto the eigenvectors of the gene ex- 
pression data covariance matrix and then applying an ii penalty in 
order to de-noise the resulting projections. We present a theoretical 
framework under which LPC is the logical choice for identifying sig- 
nificant genes, and we show that LPC can provide a marked reduction 
in false discovery rates over the conventional methods on both real 
and simulated data. Moreover, this flexible procedure can be applied 
to a variety of types of data and can be used to improve many existing 
methods for the identification of significant features. 

1. Introduction. In recent years new experimental technologies within 
the field of biology have led to data sets in which the number of features p 
greatly exceeds the number of observations n. Two such examples are gene 
expression data and data from genome- wide association studies. In the case 
of gene expression (or microarray) data, it is often of interest to identify 
genes that are differentially-expressed across conditions (for instance, some 
patients may have cancer and others may not), or that are associated with 
some type of outcome, such as survival time. Such genes might be used as 
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features in a model for prediction or classification of the outcome in a new 
patient, or they might be used as target genes for further experiments in 
order to better understand the biological processes that contribute to the 
outcome. 

Over the years, a number of methods have been developed for the iden- 
tification of differentially-expressed genes in a microarray experiment; for a 
review, see Cui and Churchill (2003) or Allison et al (2006). Usually, the as- 
sociation between a given gene and the outcome is measured using a statistic 
that is a function of that gene only. Genes for which the statistic exceeds 
a given value are considered to be differentially-expressed. Many methods 
combine information, or borrow strength, across genes in order to make a 
more informed assessment of the significance of a given gene. In the case of 
a two-class outcome, such methods include the shrunken variance estimates 
of Cui et al. (2005), the empirical Bayes approach of Lonnstedt and Speed 
(2002), the limma procedure of Smyth (2004) and the optimal discovery 
procedure (ODP) of Storey, Dai and Leek (2007). We elaborate on the lat- 
ter two procedures, as we will compare them to our method throughout 
the paper in the case of a two-class outcome. Limma assesses differential 
expression between conditions by forming a moderated t-statistic in which 
posterior residual standard deviations are used instead of the usual standard 
deviation. The ODP approach is quite different; it involves a generalization 
of the likelihood ratio statistic such that an individual gene's significance is 
computed as a function of all of the genes in the data set. In the case of a 
survival outcome, a standard method to assess a gene's significance (and the 
one to which we will compare our proposed method in this paper) is the Cox 
score; see, for example, Beer et al. (2002) and Bair and Tibshirani (2004). 

We propose a new method, called Lassoed Principal Components (LPC), 
for the identification of differentially-expressed genes. The LPC method is 
as follows. First, we compute scores for each gene using an existing method, 
such as those mentioned above. These scores are then regressed onto the 
eigenarrays of the data [Alter, Brown and Botstein (2000)] — principal com- 
ponents of length equal to the number of genes — with an L\ constraint. The 
resulting fitted values are the LPC scores. In this paper we demonstrate that 
LPC scores can result in more accurate gene rankings than the conventional 
methods, and we present theoretical justifications for the use of the LPC 
method. 

Our method has two main advantages over existing methods: 

1. LPC borrows strength across genes in an explicit manner. This benefit is 
rooted in the biological context of the data. In biological systems genes 
that are involved in the same biological process, pathway, or network 
may be co-regulated; if so, such genes may exhibit similar patterns of 
expression. One would not expect a single gene to be associated with 
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the outcome, since, in practice, many genes work together to effect a 
particular phenotype. LPC effectively down-weights individual genes that 
are associated with the outcome but that do not share an expression 
pattern with a larger group of genes, and instead favors large groups of 
genes that appear to be differentially-expressed. By implicitly using prior 
knowledge about what types of genes one expects to be differentially- 
expressed, LPC achieves improved power over existing methods in many 
examples. 

2. LPC can be applied on top of any existing method (regardless of outcome 
type) in order to obtain potentially more accurate measures of differen- 
tial expression. For instance, in the case of a two-class outcome, many 
methods to detect differentially-expressed genes exist, including SAM 
[Tusher, Tibshirani and Chu (2001)], limma [Smyth (2004)] and ODP 
[Storey, Dai and Leek (2007)], as mentioned earlier. By projecting any 
of these methods' gene scores onto the eigenarrays of the data with an 
L\ constraint, we harness the power of these methods while incorporating 
the biological context. 

The idea behind this method is that a gene that resembles the outcome is 
more likely to be significant if it is one of many genes with similar expression 
patterns than if it resembles no other genes. From a biological standpoint, 
this is due to the hypothesis that a gene that truly is associated with an 
outcome (such as cancer) will be involved in a biological pathway related 
to the outcome that involves many genes. Other genes in the pathway may 
exhibit similar expression patterns due to co-regulation. From a statistical 
standpoint, it is due to the fact that while variance in the genes' expres- 
sion levels may occasionally cause an individual nonsignificant gene to be 
correlated with the outcome by chance, it is statistically quite unlikely that 
a great number of genes will all be correlated with the outcome and with 
each other due solely to random noise. By regressing the conventional gene 
scores onto the eigenarrays of the gene expression data and using the fitted 
values to rank genes, we essentially only rank highly the genes that have 
moderate to high gene scores and have large loadings in an eigenarray that 
is correlated with the vector of gene scores. Thus, individual genes with 
expression patterns that do not resemble those of other genes in the data 
set are not given high rankings by our method. Genes with moderate scores 
that resemble a large block of genes with high scores are given high LPC 
scores; they borrow strength from genes with similar expression profiles. 

The LPC method bears similarities to the surrogate variable analysis 
(SVA) method of Leek and Storey (2007). SVA attempts to adjust for ex- 
pression heterogeneity among samples, whereas we try to exploit hetero- 
geneity in order to obtain more accurate gene scores. The effects of the two 
methods are quite different, and the methods can be used together, as is 
shown in Appendix D.2. 
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Fig. 1. Heatmap 0} simulated data set; blue signifies low and yellow signifies high gene 
expression. There are 50 genes (rows) and 10 patients (columns); 10 of the genes (shown 
on the bottom of the heatmap ) are associated with the outcome. Patients are ordered from 
oldest to youngest. The binary outcome for each patient is shown below the heatmap. The 
alternating pattern of the outcome can be seen in the significant genes in the heatmap, but 
the predominant pattern seen in these genes is associated with age. 

A simulated two-class example helps to illustrate how LPC can outper- 
form standard approaches. Suppose that expression profiles come from either 
cancer or normal tissue, and that genes over-expressed in cancer also hap- 
pen to be under-expressed in older individuals (Figure 1, see Supplementary 
Materials for R language code for this simulation [Witten and Tibshirani 
(2008)]). The expression of these genes is a function of both patient age 
and tissue type. If it is known by the data analyst that age affects gene 
expression, then age can be used as a covariate in determining whether a 
gene is differentially-expressed. However, in practice, factors that affect gene 
expression are often unknown or unmeasured, and so are not included as co- 
variates. In this case, a two-sample t-statistic will have limited success in 
identifying the genes associated with cancer type, because the age effect 
masks some of the correlation between cancer type and gene expression. On 
the other hand, applying the LPC method to the two-sample t-statistics 
results in high scores for the differentially-expressed genes, because these 
genes will have high loadings on the eigenarray that is most correlated with 
the cancer type. The resulting gene scores can be seen in Figure 2. 

LPC is not restricted to the two-class problem, and findings in the context 
of survival outcomes indicate its potential promise. Figure 3 shows the es- 
timated false discovery rates in detecting genes in lymphoma patients that 
are associated with altered survival. LPC clearly outperforms a standard 
gene-specific analysis based on Cox scores (see Section 3.3). 

The paper is organized as follows. In Section 2 we present the details of 
the LPC method, as well as some theoretical results that justify its use. 
Then, in Section 3 we demonstrate by example that LPC outperforms the 



LASSOED PRINCIPAL COMPONENTS 



5 



conventional gene scores in simulations and on published microarray data 
sets for two-class and survival outcomes. We also present a method for false 
discovery rate estimation for LPC, which is given in greater detail in Ap- 
pendix C. Section 4 contains the Discussion. 
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Fig. 2. Genes that truly are differentially-expressed are shown in red. The two-sample 
t-statistic does not assign very high scores to the differentially-expressed genes, because 
the confounding effect of age reduces the association between these genes and the outcome. 
LPC (based on the two-sample t-statistic) assigns high scores to all of the significant genes. 
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Fig. 3. Estimated false discovery rates are shown for the lymphoma data set 
[Rosenwald et al. (2002)], which consists of 7,399 gene expression measurements on 240 
patients. The false discovery rate of Cox scores is shown in black, and that of LPC is 
shown in red. 
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Table 1 

Simple scores for various outcomes 



Outcome type 



Gene score 



Quantitative Standardized regression coefficient for regression of y onto Xj 

Survival Score statistic for univariate Cox proportional hazards model; 

see, for example, Beer et al. (2002) and Bair and Tibshirani (2004) 
Two-class Two-sample t-statistic 

Multiple-class F-statistic for one-way ANOVA 



2. The LPC method. 

2.1. Description. Let X denote aniixp matrix of log transformed gene 
expression levels, where n is the number of observations and p is the number 
of genes, and n <p. Assume that the columns of X (the genes) have been 
centered so that they have mean zero across all of the observations. Let Xj 
denote the vector of expression levels for gene j. Let y denote a vector of 
length n containing the outcomes for each observation. For instance, if this 
is a two-class problem, then y will be a binary vector. 

The LPC method involves using existing gene scores to develop LPC 
scores that aim to provide a more accurate ranking of genes in terms of 
differential-expression. In principle, a wide variety of gene scores could be 
used; however, in the simplest version of LPC one would use one of the 
methods in Table 1, depending on the outcome variable. In the examples 
analyzed, a small constant is added to the denominators of the gene scores 
in Table 1 in order to avoid large ratios resulting from small estimated 
standard deviations; see, for example, Tusher, Tibshirani and Chu (2001). 
We will refer to the statistics in Table 1 as T. For ease of notation, unless 
we specify otherwise, the LPC scores discussed in this paper are formed by 
applying the LPC method to these T scores. Tj will refer to the gene score 
for gene j. The LPC method is as follows for the quantitative, survival and 
two-class cases: 
LPC Algorithm: 

1. Compute T, the vector of length p with components Tj, the score for 
gene j, for j el,..., p. 

2. Compute vi, . . . , v n , the eigenarrays of X, where the Vj's are the columns 
of the matrix V in the singular value decomposition (SVD) of X, X = 
UDV T . 

3. For some value of the parameter A, fit the model T = /3 + Ya=i A v «j 
where the vector (3 with components (3i is chosen to minimize the quan- 
tity (T - (3 - V/3) T (T -0 O - V/3) + A£?=i |AI- This is multiple linear 
regression with an L\ constraint, also known as the "lasso" [Tibshirani 
(1996)]. 
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4. Let T denote the fitted values obtained by the above model. The LPC 
score for gene j is Tj. 

In the case of a multiple-class response, the procedure is slightly different, 
and is presented in Appendix B. 

In Step 3 of the LPC algorithm, fitting a linear model with an L\ con- 
straint is very fast, because we are regressing the scores T on the columns 
of V, which are orthogonal. We use the following soft thresholding approach 
in order to obtain the lasso coefficients: 
Soft-Thresholding Algorithm: 

1. Compute 0, the vector of coefficients obtained by regressing T on the 
eigenarrays of X using ordinary multiple least squares; that is, = 
argmin /3 (T - /3 - £?=i Vj/3j) T (T -0 O - £™ =1 Vi fl). 

2. Let /3 i = sign(/3i)(|A|-|)+,_V?el,...,n. 

3. Compute T = O + Ya=i v iA; these are the LPC scores. 

The LPC algorithm involves a shrinkage parameter, A, which determines 
the amount of regularization performed in the Li-constrained regression. 
An automated method for the selection of the value for this parameter is 
presented in Appendix A. 

Returning to the example from the Introduction, the value of the shrink- 
age parameter A chosen by our automated method was 5.5. This resulted in 
Pi nonzero and @i = for i G 2, . . . , n. The first eigenarray is associated with 
the response. In this example, LPC's success stems from the fact that the L\ 
constraint resulted in a nonzero coefficient only for the correct eigenarray. 

X T y 

In the case of a quantitative response, the T scores take the form — , J 

for gene j. Suppose that the genes have been scaled appropriately so that 
the T scores are simply X T y. From the LPC Algorithm and the Soft- 
Thresholding Algorithm, the LPC scores are given by the formula T = 
0q + Ya=i v iPii where the columns of V are linear combinations of the rows 
of X. Therefore, if A = (i.e., in the absence of an L\ constraint), the LPC 
scores equal the T scores exactly. This means that T is a special case of 
LPC. This leads us to hope that if, on a given data set, T outperforms LPC 
with nonzero A, our adaptive method of choosing A will set A to zero. If this 
is the case, then we will always end up using the approach that works best 
on a particular data set. A similar result holds for the case of a two-class 
response. Note that, in practice, however, one usually does not scale the 
genes as described here. 

2.2. Motivating LPC via an underlying latent variable model. Consider 
a scenario in which a subset of genes is associated with the outcome because 
some underlying process, or "latent variable," affects both the expression of 
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the genes and the outcome measurements. In Appendix D.l, it is shown that 
in such a situation, under suitable assumptions, LPC scores will have lower 
variance than T scores. This justifies the use of LPC in situations where a 
latent variable model could reasonably describe the data set of interest. 

2.3. Relationship with the eigengene space. In microarray data analysis 
the principal components of the columns of X are referred to as the eigen- 
genes, and the principal components of the rows of X are referred to as the 
eigenarrays. We are interested in identifying significant genes; therefore, it 
may seem peculiar that our proposed method works in the space of eigenar- 
rays rather than in the space of eigengenes. For instance, Bair and Tibshirani 
(2004) and Bair et al. (2006) perform supervised principal components anal- 
ysis in the eigengene space. We show here that, in the simple case of a quan- 
titative outcome, working in the eigenarray space is quite similar to working 
in the eigengene space, but has a distinct advantage. 

Assume that the columns of X are centered and that the quantitative 
outcome y is centered. Let X = UDV T denote the singular value decompo- 
sition for X. Let T denote the vector of T statistics. The LPC method fits 
the linear model with L\ constraint on the coefficients 

n 

(2.1) T = V/3 + e = + 

i=l 

where E(e) = and Vj is the ith right singular vector of the gene expression 
data; in other words, it is the ith eigenarray of the original data X. 

Now, suppose that instead of using the conventional T scores (which, in 
this case, would be the standardized regression coefficients), we take the 
inner product of each gene Xj with the vector y. (If the genes of X were 
scaled appropriately before computing the usual T scores, then the usual T 
scores would be equivalent to these simplified scores.) The above equation 
gives 

T = X T y 

= V(3 + e 

(2-2) 

= X T UD~ 1 /3 + £ 
= X T U0 + e, 

letting 6 be a vector of length n with components #j = where di is the 
ith. diagonal element of D. 

Assuming that A = (so we are simply performing multiple least squares 
regression), then 

6 = argmin||X T y - X T U6>|| 2 
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(2.3) = argmin||X T (y-U0)|| 2 

= argmin||DU T (y - U0)|| 2 . 
Note that if XX T = UD 2 U T = I, then 

(2.4) 6 = argmin||y - U0|| 2 . 

Now, suppose that we instead regress y on the columns of U: 

y = $>^+ £ ' 

i=i 

= U0'. 

Here, if A = 0, then 0' is again given by (2.4). 

So, if A = 0, then regressing the scores on the eigenarrays is quite similar 
to regressing the outcome on the eigengenes. In fact, we have shown that if 
XX T = I, then the resulting coefficients are identical up to a scaling by the 
matrix D of the SVD. 

In general, when A = 0, the solution to the LPC least squares problem is 
the that minimizes 

(2.6) ||X r (y-U0)|| 2 = ||y-U0|| 2 , 

where y = DU T y, U = DU T U. Therefore, in the case of a quantitative 
response with simplified T scores of the form X T y, regressing the scores on 
the eigenarrays is equivalent to regressing the outcomes on the eigengenes 
after rotating the outcome and the eigengenes. 

When the outcome is quantitative, working in the eigengene space is a 
reasonable alternative to working in the eigenarray space. However, in the 
case of a nonquantitative outcome, working in the eigengene space does 
not always generalize. For instance, suppose that the outcome is survival 
time. Then, for each observation, the response is a time (e.g., number of 
months that the patient survived) and a binary variable (whether or not the 
patient was censored). It is not clear how to regress this pair of numbers 
onto the eigengenes. However, using LPC, we first compute the Cox scores; 
it is then a simple matter to regress the Cox scores onto the eigenarrays. 
Therefore, working in the eigenarray space has a distinct advantage in that 
it is applicable to a wider range of outcome types. 

3. Performance of LPC. 

3.1. Simulated data. LPC outperforms existing methods in a variety 
of simulations, for quantitative, survival, two-class, and multiple-class re- 
sponses. We present results on three such simulations here. Each simulation 



(2.5) 



10 D. M. WITTEN AND R. TIBSHIRANI 



Simulation 1 Simulalion 2 ^ SimulaliDn 3 




a x> 40 «j ea iijl o s 40 M » in d to M> eo » too 



Number of Genes Called N umber or Genes Cal led Number ol Genes Called 

- Limma 
— ODP 

■ ■ LPC (Limma) 
----- LPC (ODP) 



Fig. 4. False discovery rates are shown for Simulations 1, 2 and 3 with a two-class 
response. In each simulation, 50 out of 1000 genes are significant. ODP, LPC(ODP), 
Limma and LPC(Limma) are compared. In almost all cases, the LPC methods result in 
lower FDRs than the existing methods. T and LPC(T), though not plotted, closely resemble 
the Limma and LPC (Limma) curves for all three figures. 

involves 1000 genes and 40 observations. The first 50 genes are associated 
with the outcome for each observation, and will be referred to as the sig- 
nificant genes. Simulation 1 represents the simplest case; the only structure 
present in the data is due to the differentially-expressed genes, which closely 
resemble the outcome. Simulation 2 is more complicated: in addition to the 
genes that resemble the outcome, there are three blocks of 100 genes that 
display very strong expression patterns that are orthogonal to the outcome. 
In Simulation 3, the 50 genes that are associated with the outcome are split 
into two sets of 25 genes; the signals present in each set are orthogonal to 
each other, and the response is obtained by summing the two orthogonal 
signals. Detailed descriptions of each simulation can be found in Appendix 
E, and heatmaps and R language code can be found in the Supplementary 
Materials [Witten and Tibshirani (2008)]. 

The performances of ODP [Storey, Dai and Leek (2007)] and limma 
[Smyth (2004)] are compared to the performances of the LPC method ap- 
plied to ODP and limma for each simulation (with a two-class outcome) 
in Figure 4. It is clear that the statistics are improved by applying LPC. 
Though not shown in the figure, T is also improved by applying the LPC 
method. Similar results hold for other outcome types. 

It is worth noting that the L\ constraint in the LPC algorithm is an 
important part of the reason that LPC outperforms the statistics to which 
it is applied. A reviewer inquired whether one would expect to see the same 
gains if one simply projected the initial statistics onto, say, the first k < n 
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Fig. 5. False discovery rates are shown for a simulation with seven blocks of noise genes 
that are orthogonal to the two-class outcome. LPC performs much better than simply re- 
gressing T onto the first five eigenarrays. 



eigenarrays. Consider Simulation 2 as described above (with a two-class 
response), but assume that now there are seven, rather than three, blocks 
of 100 noise genes that are orthogonal to the response. We can compare the 
LPC scores to the scores obtained by regressing the T scores onto the first 
five eigenarrays. As expected, Figure 5 shows that this latter method results 
in very poor performance, since the blocks of noise genes dominate the first 
five principal components. On the other hand, LPC does well because the 
L\ constraint leads to sparsity in the regression coefficients, and so only the 
eigenarray or eigenarrays that are related to the response are included in 
the model. 

3.2. Predictive advantage. An objection to the use of LPC might be that 
despite its performance on simulated data and its theoretical properties, it 
simply does not rank genes using the metric that is truly of interest. For 
instance, in the case of two-class data, a biologist might truly care about 
finding genes that have different mean expression in the two classes. Such 
a researcher might claim that T succinctly quantifies the concept of inter- 
est, whereas LPC finds genes that satisfy the rather abstract and perhaps 
irrelevant requirement of having T scores which, when projected onto the 
high- variance subspace of the gene expression data, yield large values. Here 
we show that in many cases, even if the conventional T scores are the quan- 
tities of interest, LPC should be used instead of T. 

Let X and X* denote independent training and test sets for the same set 
of genes. T and L are the vectors of conventional T scores and LPC scores 
on the training set, and T* is the vector of conventional T scores on the test 
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Fig. 6. E(Pi*| | > 02(a)) (red) andE(\Tj*\ \ \Tj\ > a(a)) (black) are shown for Sim- 
ulations 1, 2 and 3 with a quantitative outcome. The difference between the two quantities 
gives the predictive advantage. The predictive advantage is positive in all cases. Similar 
results hold for other outcome types. 

set. Then, we define predictive advantage as 

(3.1) E(|T/| I \L j \>c 2 (a))-B(\T j *\ \ \T j \>c 1 (a)), 

where c±(a) and 02(a) are the a quantiles of \Tj\ and \Lj\, respectively. A 
positive predictive advantage for LPC essentially means that even if T is the 
quantity of interest, then LPC should be used instead, since LPC will pick 
out genes with higher T scores on an independent test set. If LPC has a 
positive predictive advantage on a given data set, then there is no question 
that LPC is superior to T on that data set. 

On data that we examined, the predictive advantage is often positive 
(Figures 6 and 7). The simulated data for Figure 6 has a quantitative out- 
come; for each simulation, the predictive advantage is positive. Two of the 
data sets used for Figure 7 have survival outcomes; they are the lymphoma 
data set from Rosenwald et al. (2002) and the kidney cancer data set from 
Zhao et al. (2006). The third data set is the two-class colon cancer data set 
of Alon et al. (1999). The predictive advantage of LPC is positive for the 
survival data sets, and is positive after more than the first few genes have 
been selected for the two-class data set. 

The predictive advantage provides a "quick and dirty" approach to ver- 
ifying that LPC is uniformly better at identifying significant genes on a 
given data set. We recommend the use of LPC instead of competing meth- 
ods whenever its predictive advantage relative to the competing methods is 
positive. However, the predictive advantage of LPC was not positive for all 
of the data sets that we considered. 

3.3. Estimated false discovery rates for LPC. Estimation of false discov- 
ery rates for LPC is surprisingly difficult. Due to the fact that the LPC 
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statistic for a given gene is a function of all of the genes present in the data 
set, the standard method of estimating false discovery rates by permuta- 
tions cannot be applied. More on this topic, as well as the method that we 
developed to estimate false discovery rates for LPC and other functionally 
dependent statistics using predictive advantage, can be found in Appendix 
C. 

We apply our method of estimating false discovery rates for LPC to 
the kidney cancer and lymphoma survival data sets of Zhao et al. (2006) 
and Rosenwald et al. (2002), as well as to the two-class colon data set of 
Alon et al. (1999). For the survival data sets, we compare LPC to T, and 
for the two-class data set, we compare T, limma, ODP and LPC. Figures 3 
and 8 show that the estimated FDR of LPC is less than those of the other 
methods, with the exception of the first 15 or so genes for the colon cancer 
data set. The erratic performance of LPC for these first genes is a direct 
consequence of the fact that the predictive advantage of LPC is negative for 
these first genes in this data set (see Figure 7). 

As discussed in Appendix C, our method of estimating FDRs for LPC 
involves computing FDRs for simpler scores that can be estimated through 
permutations, and then estimating the difference in FDR between LPC and 
those simpler scores. Figure 9 displays the mean estimated difference be- 
tween the FDR of T and the FDR of LPC, as well as standard errors for 
this estimate, for the colon, kidney, and lymphoma data sets. The figure was 
obtained by repeatedly sampling 90% of the observations, without replace- 
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Fig. 7. E(|Tj*| | \Lj\ > c 2 (a)) (red) and E(|T/| | \Tj\ > Ci(a)) (black) are shown for 
two survival data sets: the kidney data [Zhao et al. (2006)] and the lymphoma data 
[Rosenwald et al. (2002)]. In both cases, the predictive advantage of LPC is positive. The 
two-class colon data [Alon et al. (1999)] is also shown; in this case, the predictive ad- 
vantage is positive when more than around 15 genes are considered. The kidney data set 
involves survival times and expression measurements on 14,814 genes for 177 patients. As 
mentioned earlier, the lymphoma data set involves survival times and expression measure- 
ments on 7,399 genes for 240 patients. The colon data consists of cancer status and gene 
expression measurements on 2, 000 genes for 62 patients. 
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Fig. 8. Estimated FDR for two published microarray data sets: the Zhao et al. (2006) 
kidney cancer survival data set and the Alon et al. (1999) colon cancer two-class data set. 



ment; for each resulting sample, the difference in FDR between T and LPC 
was computed. The figure shows that for these three data sets, the estimated 
FDR of T is significantly higher than that of LPC (with the exception of 
the first few genes in the colon data set). 

3.4. Another look at LPC versus T for survival data. First, we examine 
more closely the differences between the genes selected as significant by LPC 
and T (in this case, Cox scores) in the kidney cancer data set of Zhao et al. 
(2006). When LPC is applied to this data set, several eigenarrays (out of 
177) are assigned nonzero coefficients; two of these, eigenarrays 2 and 4, 
have large absolute coefficients. Figure 10 shows the loadings of the genes 
on these two eigenarrays. The top 1% of genes selected by LPC and the 
top 1% of genes selected by Cox scores are shown. Though there is some 




Fig. 9. The mean estimated difference in FDR between T and LPC is shown in black, as 
a function of the number of genes called significant, for 20 data sets created by sampling 
90% of the observations from the original data set without replacement. An estimate of 
one standard error above and below this mean is shown in red. A positive mean difference 
indicates that T has higher FDR than LPC. 
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Genes Chosen by LPC and Cox 
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Fourth Eigenarray 

Fig. 10. The loadings of the kidney cancer genes on the second and fourth eigenarrays 
are shown. The top 1% of genes selected by LPC are shown in red, and the top 1% of genes 
selected by Cox score are shown in green. The remaining genes are in black. There is some 
overlap between the genes with the highest Cox scores and the genes with the highest LPC 
scores. 

overlap between these two gene sets, it is clear from the figure that LPC 
selects genes with extreme loadings on these two eigenarrays, whereas the 
Cox score uses a different criterion. 

The genes with the highest LPC scores have high loadings on eigenar- 
rays 2 and 4; this means that they have expression that is correlated with 
eigengenes 2 and 4. Figure 11 shows the top genes selected by LPC and not 
by Cox score, the top genes selected by Cox score and not by LPC, and the 
fourth eigengene. It is clear that the top genes selected by LPC and not by 
Cox score resemble this eigengene more closely than do the genes selected 
by Cox score and not by LPC. 

We established earlier that in the kidney cancer and lymphoma data sets, 
LPC has a positive predictive advantage relative to Cox scores. We now 
examine a related concept, the ability of the top genes ranked by LPC and 
Cox scores to predict survival. We split the kidney and lymphoma data sets 
into a training set and a test set, and identified the 25 genes with highest 
Cox and LPC scores on the training set. We fit Cox proportional hazard 
models to the test data, using the top i = 1, ... ,25 genes ranked by each 
scoring method. We then computed the medians of the log rank statistics 
obtained from each of these models over 20 iterations. The results can be 
seen in Figure 12. Models that use the top genes ranked by LPC outperform 
models that use the top genes ranked by Cox scores when not too many 
genes are included in the model. However, when enough genes are included 
in the model, then for both data sets, genes with high training set Cox scores 
eventually lead to superior models. This is due to the fact that LPC assigns 
high scores to correlated sets of genes, and including additional correlated 
predictors in a model does not lead to much improvement. On the other 
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Fig. 11. For the kidney data set, the top 29 genes selected by LPC but not by Cox score 
are shown on the top left, and the top 29 genes selected by Cox score but not by LPC are 
shown on the top right. On the bottom left and the bottom right, the fourth eigengene is 
shown. The observations are ordered from left to right by the size of their loading in the 
fourth eigengene. The top 29 genes selected by LPC resemble the fourth eigengene more 
closely than do the top 29 genes selected by Cox score. 

hand, using Cox scores results in the addition of genes to the model that 
may be less correlated with each other, and so adding a greater number of 
such genes leads to greater improvement. This example demonstrates that 
while we have shown that LPC is more successful than competing methods 
at identifying significant genes, it is not ideally suited for model selection if 
a large model is desired. 

3.5. Loss of power if assumptions are incorrect. Throughout this paper 
we have motivated LPC by arguing that, for gene expression data, genes 



Lymphoma Data Kidney Data 




& 10 IS J» 25 s 10 15 SO N 

Number of Genes In Model Number of Genes In Mocfel 

Fig. 12. Cox proportional hazard models were fit to the test data, using the 1 through 
25 genes that received highest Cox/LPC scores on the training data. When the number of 
genes in the model is low, the models fit using genes with high LPC scores had higher log 
rank statistics that those fit using genes with high Cox scores. LPC is red and T is black. 
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work together in pathways, and a clinical outcome is likely to be caused not 
by a single gene, but rather by a set a correlated genes. An obvious question 
arises: How much power to detect significant genes will we lose using LPC 
if the response is caused by a single or a few genes, rather than by a large 
pathway of genes? In this case, we might expect that univariate gene scores 
would correctly identify the significant gene, whereas LPC applied to those 
gene scores would identify the significant gene, along with other correlated 
but unimportant genes. 

As an example, we randomly selected 500 genes from the colon cancer 
data set, and performed /c-means clustering on the genes with k = 50. We 
simulated a quantitative response y as the centroid of the cluster containing 
the greatest number of genes, plus noise (2N(0, 1)). The resulting predictive 
advantage of LPC over T was positive. We then repeated the experiment, 
but this time simulated the response as the centroid of the cluster containing 
the fewest genes, plus noise. The resulting predictive advantage of LPC over 
T was negative. The predictive advantages for the two cases are displayed 
in Figure 1 in the Supplementary Materials [Witten and Tibshirani (2008)]. 

This example suggests that LPC has the potential to outperform T in 
cases where the response is caused by a large pathway of correlated genes; 
however, if this condition is not satisfied, then T may do better. As pointed 
out by an editor, it is likely that LPC will perform well in cases such as 
tumorigenesis that involve large perturbations to molecular pathways; on 
the other hand, LPC may not perform as well when more subtle changes in 
gene expression are present. If one does not know whether the use of LPC 
is warranted in a given situation, a simple plot of the predictive advantage 
will indicate whether LPC provides a benefit over T. 

4. Discussion. Many of the ideas in this paper build upon the existing 
literature. Eigenarrays and eigengenes have been used many times before 
in the analysis of microarray data; for instance, Alter, Brown and Botstein 
(2000) decompose microarray data into eigengenes and eigenarrays in or- 
der to obtain a "global picture of the dynamics of gene expression," and 
Bair and Tibshirani (2004) and Bair et al. (2006) use eigengenes in order 
to predict patient survival time. Leek and Storey (2007) use the eigengenes 
of microarray data in order to infer and remove expression heterogeneity 
that is unrelated to the outcome. (More on the relationship between LPC 
and surrogate variable analysis can be found in Appendix D.2.) In addition, 
Shen et al. (2006) make use of the singular value decomposition for the se- 
lection of genes to use in tumor classification. Unlike LPC, Shen et al. (2006) 
seek genes that are uncorrelated with each other and capture the variation 
in the data, rather than genes that are associated with the outcome. To the 
best of our knowledge, we are the first to propose regressing simple gene 
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scores onto the eigenarrays in order to obtain gene scores that make use of 
the data's correlation structure and association with the outcome. 

The notion of using latent factors and sparse representations in order to 
model biological pathways is well developed in the statistical literature. In 
particular, West (2003) and Carvalho et al. (2008) use a Bayesian approach 
to fit sparse factor models to gene expression data. They demonstrate that 
these sparse factors are easily interpreted, and can lead to the discovery and 
validation of biological pathways. Their method can also be used for gene 
selection: if a few factors are associated with the outcome, then this suggests 
that only genes with nonzero loadings in those factors are related to the 
outcome. Our approach is different, in that we use "empirical factors" rather 
that attempting to model the factors themselves in a sparse way. It would 
be of interest to combine these two approaches by regressing univariate 
statistics onto these sparse factors, rather than onto the eigenarrays of the 
data as described here. 

LPC takes advantage of the structure of the entire microarray data set 
in order to improve accuracy of gene scores. In order to achieve this same 
goal, one might instead consider the use of a full multivariate approach. For 
instance, one could regress a quantitative outcome onto the gene expression 
matrix using an L2 (ridge) penalty. The resulting coefficient for each gene 
could be treated as a measure of its differential expression. However, sup- 
pose that some genes in the expression data matrix are highly correlated 
with each other and with the outcome. Ridge regression would assign to all 
of these genes similar coefficients that are smaller in magnitude than the 
coefficients that they each would receive in a univariate regression. In effect, 
ridge regression would decrease the estimate of a gene's significance as a 
consequence of the presence of correlated genes. Similarly, one could regress 
the outcome onto the data matrix with an L\ penalty. Due to the sparse 
nature of the L\ solution, this would result in only one gene out of a corre- 
lated set of genes receiving a nonzero coefficient. From these examples, it is 
clear that such multivariate approaches are not well suited to the problem 
of identifying differentially-expressed genes. 

An attribute of LPC is that its use is not limited to the identification 
of significant genes in microarray data. Rather, it can be applied to a wide 
range of data types, provided that the hypothesis that significant features oc- 
cur in correlated sets is justified. Interestingly, a recent paper by Price et al. 
(2006) makes use of a technique that is similar to LPC for a completely 
different reason. In the analysis of genome-wide association data, in order to 
identify single-nucleotide polymorphisms (SNPs) that are associated with a 
given phenotype, Price et al. (2006) suggest "adjusting" the phenotypes and 
genotypes by regressing them onto the principal components of the genotype 
data, and using the residuals to compute a x 2 statistic to measure the asso- 
ciation between a given SNP and the phenotype. They do this to remove the 
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effects of ancestry from the data. In effect, LPC seeks features that are asso- 
ciated with the principal components of the data, whereas Price et al. (2006) 
do the opposite. This is due to the two very different biological processes 
underlying the two types of data. In the case of microarray data, we seek 
groups of genes that are correlated with each other and with the outcome; 
in the case of genome-wide association data, we seek a single causative SNP 
or set of interacting SNPS that affect disease risk. 

In practice, the LPC method can be used in two ways. The easiest ap- 
proach, which is discussed throughout most of the paper, is to apply it to 
simple T scores. As shown here, the resulting scores can result in lower 
FDR than much more complicated scores, such as ODP. Another possibil- 
ity is that the LPC method can be applied directly to more complicated 
scores such as ODP. (This was done in Figure 4.) This can lead to even 
greater decreases in FDR than simply applying LPC to T, and is an attrac- 
tive option in situations where there is no need to explain the gene scores 
to a nonstatistical audience. Though LPC outperforms T on the simulated 
and real data examples that are shown here, it does not always provide an 
improvement over T. For microarray data, we recommend the use of LPC 
instead of a competing method whenever its predictive advantage, relative 
to the competing method, is positive. 

An R software package will be made available. 

APPENDIX A: SELECTION OF THE SHRINKAGE PARAMETER 

The LPC method involves a tuning parameter, A. We choose the value of 
A via cross-validation, as follows: 
Selection of A: 

1. Compute V, the matrix of eigenarrays of X. 

2. Split the observations into a training set and a test set. 

3. For a range of values of A: 

(a) Compute L^traiin which are similar to the LPC scores described in 
the LPC Algorithm, except that the matrix V on which the T tra m 
scores are regressed is taken from Step 1 above. In other words, we 
regress the training set T scores onto the eigenarrays for the entire 
data set. 

(b) Compute the average |T tcs t| of the 50 genes with the highest |L^ )train | 
scores. 

4. Repeat steps 2 and 3 ten times, recording the average |T t0St | scores ob- 
tained in step 3(b). 

5. Choose the value of A that results in the highest average |T test | of the 50 
genes with the highest (Lorain I scores. 
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We regress the training set T scores onto the eigenarrays of the full data 
set, rather than onto the eigenarrays of the training data set, for two reasons. 
First of all, this leads to a much faster algorithm, as it means that we are 
required to compute the SVD only once, for the full data set, rather than 
once for each of the 10 training sets that we produce. In addition, if we were 
to regress the test set T scores onto the eigenarrays for the training set, we 
would not be guaranteed that those eigenarrays resemble those of the full 
data set. Therefore, it would not be clear that the optimal parameter value 
for the split data sets is also best for the full data set. Note that we are 
not over-fitting the data by using the eigenarrays for the full data set, since 
computing the eigenarrays does not involve the test set y values. 

Our method of choosing A is closely related to the concept of predictive 
advantage, presented in Section 3.2. It is worth noting that in most of our 
simulations, LPC's performance was not very sensitive to the choice of A: 
for a wide range of A values, LPC outperformed T by a comfortable margin. 

APPENDIX B: LPC FOR A MULTIPLE-CLASS RESPONSE 

In the case of a multiple-class outcome with K classes, we compute LPC 
scores in a slightly different fashion from the method described in the main 
text: 

LPC Algorithm for Multiple-Class Outcome: 

1. Compute the contrast Skj for each class k (which we will call Cf~) and for 
each gene j: S^j = XCfcJ Xj ; where Sj is the standard deviation for gene 
j. Sfc denotes the vector of contrasts for class k. 

2. Compute vi, . . . , v n , the eigenarrays of X, where the Vj's are the columns 
of the matrix V in the singular value decomposition (SVD) of X, X = 
UDV T . 

3. For some value of the parameter A, fit the model = /3 + Ya=i A v ij 
where the vector (3 with components 0i is chosen to minimize the quantity 
(S fe - /3 - V/3) T (S fc - (3 - V/3) + A Yd=i 

4. Let Skj denote the fitted values obtained by the above model for class k 
and gene j. The LPC score for gene j is J2k=i &kj- 

Note that the two-class case really is a special case of the multiple-class 
case with K = 2, where the LPC score for gene j is given by Su. 

APPENDIX C: ESTIMATION OF FALSE DISCOVERY RATE FOR 

LPC 

C.l. The problem of functionally dependent statistics. Estimation of the 
false discovery rate for the LPC procedure is surprisingly difficult. In study- 
ing this issue, we learned a more general fact about FDR estimation, which 
we discuss here. This issue is also addressed in Getz et al. (2007). 
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Table 2 

Possible outcomes from a multiple testing problem 









Predicted 






Null 


Nonnull 


True 


Null 


A 


B 




Nonnull 


C 


D 



False discovery rates are often estimated by permutations, which are sim- 
ulated from a null distribution. This is done, for example, in the SAM pro- 
cedure [Tusher, Tibshirani and Chu (2001)]. It turns out that the validity 
of this procedure relies on the fact that the statistic for gene j is a function 
of only the data for gene j. This is violated for the LPC score, which is 
functionally dependent on all of the data for all of the genes. 

Consider the usual testing scenario, with outcomes given in Table 2. The 
FDR is defined as ~E(B/(B + D)). This expectation is taken with respect 
to the population of genes, which are null and nonnull; it is not taken with 
respect to a null distribution. 

Now, null simulations try to estimate the FDR by simulating data from 
the top row of this table, counting the average number of false positives 
B* , and then using the approximation FDR = (ave jfB*) j \B + D) . This as- 
sumes that E(-B) « E nu n(i?) (the expectation under the null). This is true 
for statistics that are functions of just one gene, but is not true for function- 
ally dependent statistics. In the latter case, the interplay between null and 
nonnull genes creates a large bias in E nu n(i?) as an estimate of ~E(B). 

Figure 2 in the Supplementary Materials [Witten and Tibshirani (2008)] 
shows a simple example with p-values. We generated 1000 p-values with a 
£7[0, 1] distribution and then set the first 50 to 10~ 6 . These are shown in the 
top left panel: the spike on the left are the nonnull p-values. We generated 
1000 new p-values from U[0, 1], shown in the top right panel. As we expect, 
their distribution looks like that of the null p- values in the top left. For 
illustration, we chose a cutoff of to — 0.054, the 10% point of the histogram 
in the top left panel. Hence, 100 p-values in the top left panel are less than 
0.054. We simulated 200 versions of the top right panel, and the average 
number of p-values that were below to was 54.4. Hence, the estimated FDR 
is 54.4/100 = 0.544. The true FDR is 50%, since only one-half of these genes 
are nonnull. As we expect, the estimated FDR is close to the actual FDR. 

As an illustration, suppose that we instead use a simple transformation 
of the p-value, 



(C.l) 



CaMPj = -log 10 {np j /q j ), 
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where qj is the rank of the pj among all of the p-values. This score was 
used in Sjoblom et al. (2006): it has some appeal from an interpretability 
viewpoint, but also creates a functional dependence between the p-values. 
The bottom two panels of the figure show what happens when we repeat 
the exercise with CaMP^ instead of pj. (Note that large values of CaMPj 
indicate significance.) The histogram in the bottom right is shifted to the left 
compared to that for the null CaMP scores in the bottom left panel. Using 
a 90% cutoff of 0.236 for the CaMP score, the estimated FDR is 0.017, far 
less than the actual FDR of 50%. 

What has happened? The CaMP scores for the null genes behave differ- 
ently depending on whether or not there are nonnull scores in the data, be- 
cause the scores are functionally dependent. The presence of nonnull scores 
in the bottom left panel tends to inflate the scores for the null genes. 

This same phenomenon occurs if we use permutations to estimate the null 
distribution for an arbitrary statistic that is a function of multiple genes. If 
the permutation approach is used, the bias in the estimate of the FDR can 
result in over-estimation or under-estimation of the true FDR. The LPC 
statistic is a functionally dependent statistic, since the data for all genes 
is used to estimate the principal components. Therefore, the permutation 
approach is not suitable for the estimation of the FDR for LPC. 

C.2. Estimating FDR via predictive advantage. The FDR of T can be 

easily estimated via permutations. Therefore, it makes sense to try to as- 
sess the FDR of the LPC statistic relative to the FDR of T. We take that 
approach in this section. 

It makes sense intuitively that an estimator with higher predictive advan- 
tage (3.1) will also tend to have lower FDR. In this section we show that 
under a simple shift model, the FDR of a statistic is lower than that of T if 
the statistic has higher predictive advantage. Second, we derive an estimate 
of the FDR of LPC based on an adjustment of the FDR of T. These results 
hold for any functionally dependent statistic, not just the LPC score. 

We use the same notation as in previous sections. Let Tj be the T score 
for gene j on the training set, and Lj the LPC score for gene j on the 
training set. On the test set, the T score for gene j is denoted Tj* . Let c\(a) 
and 02(a) denote the a quantiles of Tj and Lj. Also, let Tj = Uj + Zj and 
Tj* = Uj + Zj* , where Uj , Zj and Zj* are independent, Uj = if gene j is null, 
and Zj and Zj* are identically distributed. 

We wish to show that 




P{Tj* > c\Lj > c 2 (a)) > P{Tj* > c\Tj > ci(a)) 



P{uj = 0\Lj > c 2 (a)) < P(uj = 0\Tj > ci(a)). 



LASSOED PRINCIPAL COMPONENTS 23 

Now, 

P(Tj* > c\Lj > c 2 (a)) 

= P(Tj* > c\Lj > c 2 {a),Uj = 0)P(uj = 0\Lj > c 2 (a)) 
(C.4) + P(Tj* > c\Lj > c 2 (a),Uj / 0)P( Uj / 0\Lj > c 2 [a)) 

= P(Tj* > c\Lj > c 2 (a),Uj = 0)P{uj = 0\Lj > c 2 {a)) 

+ P(Tj* > c\Lj > c 2 (a), Uj ^0)[1 - P( Uj = 0\Lj > c 2 (a))). 

So, 

P(uj =0\Lj > c 2 (a)) 

(C.5) 

P(Tj* > c\Lj > c 2 (a), Uj / 0) - P(Tj* > c\Lj > c 2 (a)) 
~ P(Tj* > c\Lj > c 2 (a),Uj ^ 0) - P(Tj* > c\Lj > c 2 (a),Uj = 0) ' 

Now, we make the additional assumption that Uj = 1 if a gene is nonnull 
(recall that we have already assumed that Uj = if a gene is null) . It follows 
that 

fn R \ v( nir ^ f ^ P(T/ > c\ Uj = 1) - P(T/ > c\L, > c 2 (q)) 
(C.6) P(u j = 0\L j >c 2 (a))= P(r/>c | M , = 1) _ P(r/>c | u . = 0) • 

Similarly, we can expand P(Tj* > c\Tj > ci(a)) to find that 

P(Tj* > c\ Uj = 1) - P(Tj* > c\Tj > ci(a)) 



(C.7) P(u j = 0\T j >c 1 (a)) 



P(Tj* > c\uj = 1) - P(Tj* > c\uj = 0) 



Note that (C.6) and (C.7) have the same denominator, which is positive. 
And, from (C.2), we know that the numerator of (C.6) is less than the 
numerator of (C.7) (and the numerators are positive). So, (C.7) is greater 
than (C.6); in other words, 

(C.8) P(uj = 0\Tj > ci(a)) > P(uj = 0\Lj > 02(a)). 

So, the FDR of LPC is bounded above by the FDR of T. This result 
assumes that there is only one parameter value in the alternative space, and 
does not exactly hold in the composite case. 

To estimate the FDR, we use (C.6) and (C.7). We approximate the dif- 
ference in FDRs by 

P(uj = 0\Tj > ci(a)) - P{uj = 0\Lj > c 2 (a)) 

(C.9) 



(1 - vro) • 



P(T* > c\Lj > c 2 (a)) - P(T* > c\Tj > ci(a)) 



P{T* > c) - P(T* > c\ Uj = 0) 
where ttq is the proportion of genes in the data set that are not significant. 



24 



D. M. WITTEN AND R. TIBSHIRANI 



We deal with c in the computations by using the relation E(X) = /o°(l — 
F{x))dx for a positive random variable. Hence, we average the quantities 
on the right-hand side (e.g., T*). 

Figure 3 in the Supplementary Materials [Witten and Tibshirani (2008)] 
shows the estimated FDRs for LPC and T for Simulations 1, 2, and 3 with a 
quantitative response variable. The estimate of FDR for LPC [using (C.9)] 
is pretty accurate. We have found that it behaves fairly well in general, and 
it tends to err in the conservative direction. 

APPENDIX D: AN UNDERLYING LATENT VARIABLE MODEL 

D.l. Model and results. Assume that the data are generated under the 
following model, in which the response y is quantitative: 

Vi = 00 + fiiun+ Si, 
(D.l) Xij = a j + ciaijUn + c 2 a 2j u i2 + Zij, 

E(si) = E(zij) = EieiZij) = 0, 

where P denotes the set of important genes, and ay = if j ^ P, a%j 7^ 

if j € P. Let ui = {u\\ ■ ■ ■ u n \) T and U2 = (ui2 • • • u n 2) T be orthonormal 
with mean zero. Similarly, let ol\ = (an • • • a\ p ) T and cx 2 = (a 2 \ ■ ■ ■ a 2p ) T 
be orthonormal with mean zero. Assume that y has been centered to have 
mean zero so that /?o = 0, and that the genes are centered, so that «oj = 0. 
Also, assume that the Z{j £1X6 independent and identically distributed for all 

1 G 1, ... ,ri and j € 1, . . . ,p, and that the £j are independent and identically 
distributed for all i € 1, . . . , n. 

First, we consider simplified T scores that take the form Tj = Xjy for 
gene j (these simplified scores are equivalent to the usual T scores one would 
obtain if one first scaled the columns of X appropriately). Then, 

E(T i ) = E(Xjy) 

n 

(D.2) = ^E((ciaijUii + c 2 a 2j u i2 + z^^iUu + £*)) 

i=l 

= ciaijPi, 

because E(zij) = E(ej) = E(%£j) = 0, and ui and U2 are orthonormal with 
mean zero. 

Now, suppose that we ignore the error associated with estimation of the 
principal components, so that we estimate some eigenarray v of the data 
matrix X to equal cx\ = (an • • • ai p ) T from the underlying model. We can 
regress X T y onto v= {v\ ■ • -v p ) in order to obtain LPC scores: 

(D.3) f = (X T y,v)v, 



LASSOED PRINCIPAL COMPONENTS 



25 



where (•, •) denotes inner product, and T = (7\ • • • T p ) T . We want the expec- 
tation of Tj\ 

(D.4) E(Tj) = E(^-(X T y, v» = ayciftafati = = E{Tj). 

Now, to find the variance of Tj, 

Var(^)=Var(t; J (X T y,v)) 
= ah Var(a 1 T X T y) 

(D.5) 

= a lj a 1 Var(X yjcti 
= aljcej Var(T)o:x- 

Now, 

(D.6) Var(T) = (c? aiaf + c^c^) Var(e;) + I pxp (pf Var(jzy) + Var(zje)) 
and so 

Var(Tj) = c^a^- Var(ej) + a\fi\ Var(zjj) + af • Var(zje), 

(D.7) Var(T i ) = (cfa^- + c 2 2 a 2 2j ) Var(^) + $ Var(^) + Var(zje), 

Var(rj) - Var(fj) = <? 2 a\ Var(e;) + (1 - a\j)\fi\ Var(z 4i ) + Var(zje)]. 

Therefore, using Tj rather than Xjy to rank the significant genes results 
in a decrease in variance (note that we have assumed that v = ol\ ; we are 
ignoring the variance associated with the estimation of the eigenarrays) . 
Equation (D.4) indicates that the LPC scores and the T scores have the 
same expectation. 

In this example, only one latent factor (ui) is related to the response. As 
pointed out by a reviewer, the argument does not extend exactly to the case 
of multiple latent factors. 

Details of the calculations performed in this section can be found in the 
Supplementary Materials [Witten and Tibshirani (2008)]. 

D.2. Relation to surrogate variable analysis. In a recent paper 
Leek and Storey (2007) present surrogate variable analysis (SVA), a novel 
method that incorporates sources of expression heterogeneity into microar- 
ray data analysis in order to increase power and mitigate spurious signals in 
genes that are unrelated to outcome. An editor noted that there are similar- 
ities between SVA and LPC. While both methods involve using the singular 
value decomposition in order to identify structure in the data shared by the 
response and the gene expression measurements, they have different effects. 

Consider, as an example, the result of applying SVA to the data gen- 
erated under our latent variable model (Appendix D.l). We use the same 



26 



D. M. WITTEN AND R. TIBSHIRANI 



notation as in that section. SVA first detects "unmodeled factors" in the 
gene expression data that are not explained by the model 

(D.8) Xij = Hj + fjiyi) + Eij. 

Let us assume that SVA is able to correctly identify the u 2 term as a source 
of expression heterogeneity unrelated to the outcome. Then, in subsequent 
analysis, Leek and Storey (2007) suggest the use of the modified data 

(D.9) X.*=X j -c 2 a 2j U2, 

which is what remains after removing the surrogate variable that we esti- 
mated. The conventional scores using the modified data (call them T*) will 
take the form 

T* = Xf y 

(D.10) =(X i -c 2 a 2j u 2 ) T (/3 1 u 1 + e) 

= (ciaij-ui + Zj) T (/3iUi + e). 

It is clear that E(T*) = E(Tj). As a consequence of our assumption that 
we were able to successfully identify the surrogate variable u 2 , it follows 
that Var(T/) = c\a\j Var(^) + fi\ Var(zy) + Var(zJ<=:) < Var(Tj-). So, under 
our model, SVA has successfully reduced the variance of the T scores. 

However, at this point, the SVA scores T* can be regressed onto the 
eigenarrays of the original data, as in the LPC method, in order to obtain 
a reduction in variance as in Section D.l. In fact, by the argument in the 
previous section, the variance of LPC applied to T* is the same as that of 
LPC applied to T (and is less than the variance of T*). Therefore, in this 
example, LPC provides benefits in variance reduction that are beyond those 
given by SVA. In general, the LPC method can be applied on top of gene 
scores obtained after expression heterogeneity has been removed using SVA. 

APPENDIX E: SIMULATION DETAILS 

E.l. Description of simulations. Here, we describe in greater details the 
simulations used in Section 3.1. 

Our first simulation represents the simplest case: all of the signal in the 
data is associated with the outcome. 
Simulation 1: Simple Case 

1. In observations 1-20, fji ~ iV(6, 1), and in observations 21-40, yi ~ N(5, 1). 

2. In observations 1-20, ~ N(2, 1) for j < 50. Otherwise (for i > 20 and 
for j>50), X^~iV(0,l). 

In our second simulation the significant genes have less signal than large 
blocks of genes that are unrelated to the outcome. 
Simulation 2: Blocks of Noise Genes 
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1. In observations 1-20, yi ~ iV(12.5, 1), and in observations 21-40, y« ~ 
iV(10,l). 

2. In observations 1-20 and in genes 1-50, X™ ~ iV(1.5, 1). 

3. For all other i, j, we let JQ,- ~ iV(0, 1), with the following exception: there 
are three blocks of 100 genes each that are distributed N(2, 1) or N(—2, 1) 
in 10 observations; these three blocks of genes are orthogonal to the signal 
present in the outcome and in the first 50 genes. 

The blocks of noise genes represent groups of genes with strong expression 
patterns that are unrelated to the outcome of interest. 

In our final simulation there are two sets of 25 significant genes that have 
orthogonal expression patterns. The quantitative outcome is the sum of the 
expression patterns in the two sets of significant genes. 
Simulation 3: Two Sets of Orthogonal Significant Genes 

1. For observations 1-10, yi ~ iV(10, 1); for observations 11-30, ?/, ~ 1), 
and for observations 31-40, yi ~ iV(12, 1). 

2. For genes 1-25, Xy ~ N(0, 1) for observations 1-20 and ~ N(2, 1) for 
observations 21-40. 

3. For genes 26-50, X^ ~ iV(0, 1) for observations 1-10 and 21-30. For ob- 
servations 11-20 and 31-40, X tj ~ N(2, 1). 

4. For genes 51-1000, X tj ~ N(0, 1). 

Heatmaps for each simulation can be seen in Figure 4 in the Supplemen- 
tary Materials [Witten and Tibshirani (2008)]. The simulations described 
above have a quantitative outcome; two-class outcomes are obtained using 
an indicator variable for whether the outcome for a given observation is 
greater than or less than the median outcome across all observations. 
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SUPPLEMENTARY MATERIAL 

Supplementary materials: Testing significance of features by lassoed prin- 
cipal components (DOI: 10. 1214/08- AOAS182SUPP; .pdf). R code for sim- 
ulations, details of variance derivations for latent variable model and sup- 
porting figures. 
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